Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations (2401.00162v3)
Abstract: The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.
- Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279 (2013) (3) Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021) (4) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021) (4) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K.O., Clune, J.: First return, then explore. Nature 590(7847), 580–586 (2021) (4) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V.D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016) (5) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Lillicrap, T., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016) (6) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR (7) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR (8) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Florensa, C., Duan, Y., Abbeel, P.: Stochastic neural networks for hierarchical reinforcement learning. In: International Conference on Learning Representations (2017) (9) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Zhang, J., Yu, H., Xu, W.: Hierarchical reinforcement learning by discovering intrinsic options. In: International Conference on Learning Representations (2021) (10) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Li, S., Zhang, J., Wang, J., Yu, Y., Zhang, C.: Active hierarchical exploration with stable subgoal representation learning. In: International Conference on Learning Representations (2022) (11) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Yang, T., Tang, H., Bai, C., Liu, J., Hao, J., Meng, Z., Liu, P., Wang, Z.: Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668 (2021) (12) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Gangwani, T., Zhou, Y., Peng, J.: Learning guidance rewards with trajectory-space smoothing. Advances in Neural Information Processing Systems 33, 822–832 (2020) (13) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Guo, Y., Choi, J., Moczulski, M., Feng, S., Bengio, S., Norouzi, M., Lee, H.: Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020) (14) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Jing, M., Ma, X., Huang, W., Sun, F., Yang, C., Fang, B., Liu, H.: Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) (15) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ha, D.R., Schmidhuber, J.: World models. ArXiv abs/1803.10122 (2018) (16) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. ArXiv abs/1903.00374 (2019) (17) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T.P., Silver, D.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2019) (18) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Harutyunyan, A., Dabney, W., Mesnard, T., Gheshlaghi Azar, M., Piot, B., Heess, N., van Hasselt, H.P., Wayne, G., Singh, S., Precup, D., et al.: Hindsight credit assignment. Advances in neural information processing systems 32 (2019) (19) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Mesnard, T., Weber, T., Viola, F., Thakoor, S., Saade, A., Harutyunyan, A., Dabney, W., Stepleton, T.S., Heess, N., Guez, A., et al.: Counterfactual credit assignment in model-free reinforcement learning. In: International Conference on Machine Learning, pp. 7654–7664 (2021). PMLR (20) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Hung, C.-C., Lillicrap, T.P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., Wayne, G.: Optimizing agent behavior over long time scales by transporting value. Nature Communications 10 (2018) (21) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ferret, J., Marinier, R., Geist, M., Pietquin, O.: Credit assignment as a proxy for transfer in reinforcement learning. ArXiv abs/1907.08027 (2019) (22) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems 32 (2019) (23) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Zhou, A., Jang, E., Kappler, D., Herzog, A., Khansari, M., Wohlhart, P., Bai, Y., Kalakrishnan, M., Levine, S., Finn, C.: Watch, try, learn: Meta-learning from demonstrations and rewards. In: International Conference on Learning Representations (2020) (24) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Hester, T., Vecerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J.P., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2018) (25) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) (26) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Gulcehre, C., Le Paine, T., Shahriari, B., Denil, M., Hoffman, M., Soyer, H., Tanburn, R., Kapturowski, S., Rabinowitz, N., Williams, D., et al.: Making efficient use of demonstrations to solve hard exploration problems. In: International Conference on Learning Representations (2019) (27) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Libardi, G., De Fabritiis, G., Dittert, S.: Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning, pp. 6611–6620 (2021). PMLR (28) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016) (29) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Cruz Jr, G.V., Du, Y., Taylor, M.E.: Pre-training neural networks with human demonstrations for deep reinforcement learning. arXiv preprint arXiv:1709.04083 (2017) (30) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: International Conference on Machine Learning, pp. 2469–2478 (2018). PMLR (31) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Sun, W., Bagnell, J.A., Boots, B.: Truncated horizon policy search: Combining reinforcement learning & imitation learning. ArXiv abs/1805.11240 (2018) (32) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., Shakkottai, S.: Reinforcement learning with sparse rewards using guidance from offline demonstration. In: International Conference on Learning Representations (2022) (33) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3878–3887 (2018). PMLR (34) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. In: International Conference on Learning Representations (2019) (35) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Pritzel, A., Uria, B., Srinivasan, S., Badia, A.P., Vinyals, O., Hassabis, D., Wierstra, D., Blundell, C.: Neural episodic control. In: International Conference on Machine Learning, pp. 2827–2836 (2017). PMLR (36) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: Model-free episodic control. arXiv preprint arXiv:1606.04460 (2016) (37) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Lin, Z., Zhao, T., Yang, G., Zhang, L.: Episodic memory deep q-networks, 2433–2439 (2018) (38) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Lee, S.Y., Sungik, C., Chung, S.-Y.: Sample-efficient deep reinforcement learning via episodic backward update. Advances in Neural Information Processing Systems 32 (2019) (39) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Hu, H., Ye, J., Zhu, G., Ren, Z., Zhang, C.: Generalizable episodic memory for deep reinforcement learning. In: International Conference on Machine Learning, pp. 4380–4390 (2021). PMLR (40) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Le, H., Karimpanal George, T., Abdolshah, M., Tran, T., Venkatesh, S.: Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems 34, 30313–30325 (2021) (41) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., et al.: Never give up: Learning directed exploration strategies. In: International Conference on Learning Representations (2020) (42) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) (43) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Kumar, A., Hong, J., Singh, A., Levine, S.: Should i run offline reinforcement learning or behavioral cloning? In: International Conference on Learning Representations (2021) (44) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011). JMLR Workshop and Conference Proceedings (45) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning. In: Icml, vol. 1, p. 2 (2000) (46) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K., et al.: Maximum entropy inverse reinforcement learning. In: Aaai, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA (47) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ho, J., Ermon, S.: Generative adversarial imitation learning. Advances in neural information processing systems 29 (2016) (48) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020) (49) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Pocius, R., Isele, D., Roberts, M., Aha, D.: Comparing reward shaping, visual hints, and curriculum learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) (50) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Icml, vol. 99, pp. 278–287 (1999). Citeseer (51) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Zheng, Z., Vuorio, R., Lewis, R., Singh, S.: Adaptive pairwise weights for temporal credit assignment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 9225–9232 (2022) (52) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Raposo, D., Ritter, S., Santoro, A., Wayne, G., Weber, T., Botvinick, M., van Hasselt, H., Song, F.: Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425 (2021) (53) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. Advances in neural information processing systems 31 (2018) (54) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in neural information processing systems 19, 513–520 (2006) (55) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., Sriperumbudur, B.K.: Optimal kernel choice for large-scale two-sample tests. In: Advances in Neural Information Processing Systems, pp. 1205–1213 (2012). Citeseer (56) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Masood, M.A., Doshi-Velez, F.: Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In: International Joint Conferences on Artificial Intelligence Organization (2019) (57) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: In Proc. 19th International Conference on Machine Learning (2002). Citeseer (58) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: International Conference on Machine Learning, pp. 22–31 (2017). PMLR (59) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, ??? (2011) (60) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016). PMLR (61) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Oh, J., Guo, Y., Singh, S., Lee, H.: Generative adversarial self-imitation learning. In: International Conference on Learning Representations (2019) (62) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
- Guojian Wang (10 papers)
- Faguo Wu (7 papers)
- Xiao Zhang (435 papers)
- Tianyuan Chen (4 papers)